pRactice corner: Data Import

lruolin

R4DS Practice 08: Data Import

The codes below are from the practice exercises in https://r4ds.had.co.nz/, and are taken with reference from: https://jrnold.github.io/r4ds-exercise-solutions/

Let’s begin now

Loading tidyverse package.

library(tidyverse)

Introduction

This chapter is about importing data (plain text rectangular files) into R.

A cheatsheet is available at: https://readr.tidyverse.org/

Importing in csv file:

It is important to define the path to the file to read. The data below is a dataset from tidytuesday, and the data was scrapped from fastfoodnutrition.org.

# Importing Fast Food Data from

calories <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-09-04/fastfood_calories.csv")

glimpse(calories)

Rows: 515
Columns: 18
$ X1          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonald…
$ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Baco…
$ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 77…
$ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400,…
$ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 2…
$ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.…
$ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5…
$ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40,…
$ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 10…
$ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, …
$ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, …
$ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 1…
$ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, …
$ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10…
$ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15…
$ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 3…
$ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Ot…

To read a file where the fields are separated by “|”, use the read_delim() function, and specify the delimiter.

For read_csv(), you can also use trim_ws to trim the whitespace before and after cells, indicate the locale, specify what to do with NA, and show the progress bar if needed.

To read the following:

x <- "x, y\n1, `a, b`"
read_csv(x, quote = "`")

# A tibble: 1 x 2
      x y    
  <dbl> <chr>
1     1 a, b

Identify what is wrong with each of the following inline CSV files:

read_csv("a, b\n1,2,3\n4,5,6")

# A tibble: 2 x 2
      a     b
  <dbl> <dbl>
1     1     2
2     4     5

# only 2 columns have headers specified but there should be three column headers

read_csv("a,b,c\n1,2\n1,2,3,4")

# A tibble: 2 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2    NA
2     1     2     3

# the number of columns in data do not match number of columns in headers

read_csv("a,b\n\"1")

# A tibble: 1 x 2
      a b    
  <dbl> <chr>
1     1 <NA>

# the number of quotation marks are not correct

read_csv("a,b\n1,2\na,b")

# A tibble: 2 x 2
  a     b    
  <chr> <chr>
1 1     2    
2 a     b

# col a had a value of a?

read_csv("a;b\n1;3") # delimiter is ;, use read_csv2

# A tibble: 1 x 1
  `a;b`
  <chr>
1 1;3

read_csv2("a;b\n1;3")

# A tibble: 1 x 2
      a     b
  <dbl> <dbl>
1     1     3

Parsing a vector

The important arguments to locale are:

locale( date_names = “en”, date_format = “%AD”, time_format = “%AT”, decimal_mark = “.”, grouping_mark = “,”, tz = “UTC”, encoding = “UTF-8”, asciify = FALSE )

Arguments:

date_names
Character representations of day and month names. Either the language code as string (passed on to date_names_lang()) or an object created by date_names().

date_format, time_format
Default date and time formats.

decimal_mark, grouping_mark Symbols used to indicate the decimal place, and to chunk larger numbers. Decimal mark can only be , or ..

tz
Default tz. This is used both for input (if the time zone isn’t present in individual strings), and for output (to control the default display). The default is to use “UTC”, a time zone that does not use daylight savings time (DST) and hence is typically most useful for data. The absence of time zones makes it approximately 50x faster to generate UTC times than any other time zone.

Use "" to use the system default time zone, but beware that this will not be reproducible across systems.

For a complete list of possible time zones, see OlsonNames(). Americans, note that “EST” is a Canadian time zone that does not have DST. It is not Eastern Standard Time. It’s better to use “US/Eastern”, “US/Central” etc.

encoding
Default encoding. This only affects how the file is read - readr always converts the output to UTF-8.

asciify Should diacritics be stripped from date names and converted to ASCII? This is useful if you’re dealing with ASCII data where the correct spellings have been lost. Requires the stringi package

Generate the correct format string to parse each of the following dates and times:

d1 <- "January 1, 2010"

parse_date(d1, "%B %d, %Y")

[1] "2010-01-01"

#
d2 <- "2015-Mar-07"

parse_date(d2, "%Y-%b-%d")

[1] "2015-03-07"

#
d3 <- "06-Jun-2017"

parse_date(d3, "%d-%b-%Y")

[1] "2017-06-06"

#
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")

[1] "2015-08-19" "2015-07-01"

#
d5 <- "12/03/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")

[1] "2014-12-03"

#
t1 <- "1705"
parse_time(t1, "%H%M")

17:05:00

#
t2 <- "11:15:10.12 PM"
parse_time(t2, "%H:%M:%OS %p")

23:15:10.12

datapasta

I learnt about this package called datapaste, which allows you to copy and paste data from excel, Wikipedia, directy into R without saving as a separate excel file. This website gives a good summary on the functions. In short, you just have to select the text, and paste into R. Lovely!

I copied the data below from the SFA website using the datapaste package, but I had to modify some of the text arrangement as commar was used as a separator for ’000; and added Y to the years so that they will be recognized as text.

library(datapasta)

chicken_consumption <- tribble(
  ~Item, ~Y2010, ~Y2011, ~Y2012, ~Y2013, ~Y2014, ~Y2015, ~Y2016, ~Y2017, ~Y2018, ~Y2019,
  ## -- ## 
  "Chicken Consumption (tonnes)", 161460, 170315, 173521, 172054, 170926, 179122, 195290, 167588, 193142, 193903
  
  )

glimpse(chicken_consumption)

Rows: 1
Columns: 11
$ Item  <chr> "Chicken Consumption (tonnes)"
$ Y2010 <dbl> 161460
$ Y2011 <dbl> 170315
$ Y2012 <dbl> 173521
$ Y2013 <dbl> 172054
$ Y2014 <dbl> 170926
$ Y2015 <dbl> 179122
$ Y2016 <dbl> 195290
$ Y2017 <dbl> 167588
$ Y2018 <dbl> 193142
$ Y2019 <dbl> 193903

chicken_consumption %>% 
  pivot_longer(cols = starts_with("Y"),
               names_to = "Year",
               values_to = "Tonnes") %>% 
  select(Year, Tonnes) %>% 
  mutate(year_number = (parse_number(Year))) %>% 
  ggplot(aes(x = year_number, y = Tonnes)) +
  geom_line(col = "brown") +
  geom_point(size = 3, col = "brown") +
  labs( title = "Chicken Consumption(tonnes) in Singapore",
        subtitle = "Chicken consumption had increased in general, except for 2017.",
        caption = "Source: Singapore Food Agency") +
  scale_x_continuous(limits = c(2010, 2019), n.breaks = 10) +
  theme_classic()